πΊοΈ COMPLETE ROADMAP: Building Text-to-Image & Image-to-Text Models
From Scratch β Production Service β Cutting-Edge Research
1. FOUNDATION PREREQUISITES
1.1 Mathematics (Non-Negotiable Core)
Linear Algebra
- Vectors, matrices, tensors (rank 0 β rank N)
- Matrix multiplication, dot products, outer products
- Eigenvalues, eigenvectors, SVD (Singular Value Decomposition)
- PCA (Principal Component Analysis) β used in latent space analysis
- Norms (L1, L2, Frobenius), distance metrics
- Jacobians and Hessians (for backpropagation)
Calculus
- Partial derivatives, chain rule (core of backprop)
- Gradient descent and its variants (intuition level)
- Taylor series approximations
- Integral calculus for probability distributions
- Multivariable optimization
Probability & Statistics
- Probability distributions: Gaussian, Bernoulli, Categorical, Beta, Dirichlet
- Bayesian inference: prior, likelihood, posterior
- KL Divergence, Jensen-Shannon Divergence
- Maximum Likelihood Estimation (MLE)
- ELBO (Evidence Lower BOund) β critical for VAEs
- Monte Carlo methods, importance sampling
- Markov chains and stationary distributions
Information Theory
- Entropy, cross-entropy, mutual information
- Rate-distortion theory
- Bits-back coding (used in compression-based generative models)
1.2 Programming Fundamentals
Python (Primary Language)
- OOP: classes, inheritance, decorators, metaclasses
- Functional programming: map, filter, lambda, closures
- Async/await, threading, multiprocessing
- Memory profiling and optimization
- Type hints and dataclasses
Scientific Python Stack
- NumPy: array broadcasting, vectorized ops, memory layouts
- SciPy: optimization, signal processing
- Matplotlib/Seaborn: visualization of training curves, attention maps
- Pandas: dataset management
- OpenCV: image read/write, color space conversion, augmentation
1.3 Deep Learning Framework Mastery
PyTorch (Recommended Primary)
- Tensor operations, autograd, computational graphs
nn.Module, custom layers, hooks- DataLoaders, custom Datasets, samplers
- Mixed precision training (
torch.cuda.amp) - Distributed training (
torch.distributed, DDP) - TorchScript, ONNX export
torch.compile(PyTorch 2.0+)
JAX (Optional but Powerful)
- Functional transformations:
jit,grad,vmap,pmap - XLA compilation
- Flax and Haiku as neural net libraries
TensorFlow / Keras
- Keras functional API, custom training loops
- TensorFlow Serving for production
2. STRUCTURED LEARNING PATH
PHASE 1: Classical Computer Vision (Weeks 1β6)
Week 1β2: Image Fundamentals
- Pixel representation (RGB, RGBA, grayscale, YCbCr, HSV)
- Image histograms, equalization, CLAHE
- Convolution, kernels: Gaussian blur, Sobel, Laplacian, Unsharp masking
- Fourier Transform for images (FFT, frequency domain filtering)
- Morphological operations: erosion, dilation, opening, closing
- Harris corner detection, SIFT, ORB keypoints
Week 3β4: Classical ML on Images
- SVM for image classification (HOG + SVM pipeline)
- K-means clustering for color quantization
- PCA for face recognition (Eigenfaces)
- Bag of Visual Words (BoVW)
- Random forests on feature descriptors
Week 5β6: Deep Learning for Vision (CNNs)
- LeNet-5 β AlexNet β VGG β GoogLeNet β ResNet progression
- Residual connections, bottleneck blocks, depthwise separable convolutions
- Batch normalization, layer normalization, group normalization
- Transfer learning and fine-tuning strategies
- Object detection: YOLO family, Faster R-CNN, SSD
- Semantic segmentation: FCN, U-Net, DeepLab
PHASE 2: Sequence Modeling & NLP (Weeks 7β12)
Week 7β8: RNNs and Language
- Vanishing gradient problem, LSTM, GRU internals
- Seq2Seq architecture with encoder-decoder
- Attention mechanism (Bahdanau, Luong)
- Word embeddings: Word2Vec (CBOW, Skip-gram), GloVe, FastText
- Byte Pair Encoding (BPE) tokenization
- WordPiece, SentencePiece tokenizers
Week 9β10: Transformer Architecture (Most Critical)
- Self-attention: Query, Key, Value matrices
- Scaled dot-product attention:
softmax(QK^T / sqrt(d_k)) * V - Multi-head attention: parallel attention heads, head concatenation
- Positional encodings: sinusoidal (original), learned, RoPE, ALiBi
- Feed-forward sublayers, residual connections, LayerNorm
- Encoder-only (BERT-style), Decoder-only (GPT-style), Encoder-Decoder (T5-style)
- Flash Attention 1 & 2 (memory-efficient attention)
- Cross-attention (key mechanism linking text and image)
Week 11β12: Large Language Models
- Pre-training objectives: MLM, CLM, span corruption
- Fine-tuning: full fine-tune, LoRA, QLoRA, prefix tuning, prompt tuning
- RLHF (Reinforcement Learning from Human Feedback)
- DPO (Direct Preference Optimization)
- CLIP training: contrastive learning between text and image embeddings
PHASE 3: Generative Models (Weeks 13β22)
Week 13β14: Autoencoders & VAEs
- Vanilla Autoencoder: encoder, bottleneck, decoder
- Denoising Autoencoder, Sparse Autoencoder
- Variational Autoencoder (VAE):
- Reparameterization trick:
z = ΞΌ + Ξ΅ * Ο - ELBO loss = Reconstruction loss + KL divergence
- Posterior collapse problem and solutions
- Reparameterization trick:
- Vector Quantized VAE (VQ-VAE):
- Codebook learning, commitment loss, straight-through estimator
- VQ-VAE-2: hierarchical latent codes
Week 15β16: Generative Adversarial Networks (GANs)
- Original GAN: Generator vs Discriminator minimax game
- Training instabilities: mode collapse, vanishing gradients
- DCGAN (Deep Convolutional GAN)
- Conditional GAN (cGAN): conditioning on class labels
- Pix2Pix: image-to-image translation with L1 + adversarial loss
- CycleGAN: unpaired image-to-image translation
- StyleGAN / StyleGAN2 / StyleGAN3:
- Mapping network, AdaIN (Adaptive Instance Normalization)
- Progressive growing, path length regularization
- W-space and W+ space for editing
- BigGAN: class-conditional large-scale synthesis
- WGAN, WGAN-GP (Wasserstein loss, gradient penalty)
Week 17β20: Diffusion Models (The Current State-of-the-Art)
- Denoising Diffusion Probabilistic Models (DDPM):
- Forward process:
q(x_t | x_{t-1})= Gaussian noise schedule - Reverse process: learn
p_ΞΈ(x_{t-1} | x_t) - Noise prediction network (U-Net backbone)
- Variance schedules: linear, cosine, sigmoid
- Forward process:
- Score Matching:
- Stein score function:
β_x log p(x) - Denoising score matching objective
- Score-based generative models (NCSN)
- Stein score function:
- Stochastic Differential Equations (Score SDEs):
- VE-SDE (Variance Exploding), VP-SDE (Variance Preserving)
- Continuous-time diffusion framework
- Accelerated Sampling:
- DDIM (Denoising Diffusion Implicit Models): deterministic, fewer steps
- DPM-Solver, DPM-Solver++: ODE-based, 10β20 steps
- PNDM, UniPC, LCM (Latent Consistency Models)
- Flow Matching (Rectified Flow, Stable Diffusion 3)
- Latent Diffusion Models (LDM):
- Encode image to compressed latent space via VAE
- Run diffusion in latent space (4Γ or 8Γ spatial compression)
- Decode latent to image with VAE decoder
- This is the core of Stable Diffusion
- Conditioning Mechanisms:
- Class conditioning via embedding addition
- Text conditioning via cross-attention layers
- CLIP text encoder as condition signal
- Classifier-Free Guidance (CFG):
Ξ΅_guided = Ξ΅_uncond + w*(Ξ΅_cond - Ξ΅_uncond) - Classifier Guidance (original approach)
Week 21β22: Flow-Based and Other Generative Models
- Normalizing Flows: change-of-variables formula, invertible networks
- RealNVP, Glow, FFJORD
- Autoregressive Models: PixelCNN, VQ-VAE + transformer (DALL-E 1)
- Energy-Based Models (EBMs) and their connection to diffusion
- Consistency Models: distillation-based single-step generation
PHASE 4: Vision-Language Models (Weeks 23β30)
Week 23β24: CLIP and Contrastive Learning
- CLIP architecture: image encoder (ViT or ResNet) + text encoder (Transformer)
- Contrastive loss: InfoNCE, NT-Xent
- Zero-shot classification via CLIP
- CLIP embeddings as universal representation
- OpenCLIP, SigLIP, MetaCLIP variants
Week 25β26: Image Captioning (Image-to-Text)
- CNN + LSTM baseline (Show and Tell, 2015)
- CNN + Attention + LSTM (Show, Attend and Tell)
- Bottom-up, Top-down attention (Anderson et al.)
- ViT + GPT-2 prefix captioning
- BLIP (Bootstrapping Language-Image Pre-training):
- Image-text contrastive (ITC)
- Image-text matching (ITM)
- Image-conditioned text generation (LM)
- Bootstrapping with noisy web data
- BLIP-2: Q-Former architecture bridging frozen image encoder and frozen LLM
- LLaVA (Large Language and Vision Assistant)
Week 27β28: Text-to-Image (Full Pipeline)
- DALL-E 1: dVAE + GPT transformer autoregressive approach
- DALL-E 2: CLIP image embedding β diffusion decoder (unCLIP)
- Imagen: T5 text encoder + cascaded diffusion (pixel space)
- Stable Diffusion 1.x / 2.x:
- KL-reg VAE, U-Net with cross-attention, CLIP ViT-L/14
- Stable Diffusion XL (SDXL):
- Dual text encoders (CLIP ViT-L + OpenCLIP ViT-G)
- Base + Refiner two-stage pipeline
- Micro-conditioning (image size, crop)
- Stable Diffusion 3 / 3.5:
- Multimodal Diffusion Transformer (MMDiT)
- Flow Matching instead of DDPM
- Improved text rendering, composition
- Midjourney (proprietary), Adobe Firefly, FLUX (Black Forest Labs)
- FLUX.1: Rectified Flow Transformer, 12B parameters
Week 29β30: Multimodal LLMs
- Flamingo: perceiver resampler bridging vision and language
- GPT-4V, Claude 3 Vision, Gemini β architecture insights
- Phi-3 Vision, Idefics, InternVL
- CogVLM, Qwen-VL, MiniGPT-4
- Video understanding: Video-LLaMA, VideoChat
3. ALGORITHMS, TECHNIQUES & TOOLS
3.1 Core Algorithms
For Text-to-Image
| Algorithm | Year | Key Contribution |
|---|---|---|
| GAN (Goodfellow) | 2014 | Adversarial training paradigm |
| DCGAN | 2015 | Stable CNN-based GAN |
| VAE | 2013 | Latent variable generative model |
| VQ-VAE | 2017 | Discrete latent codes |
| DDPM | 2020 | Score-based diffusion |
| DDIM | 2020 | Fast deterministic sampling |
| CLIP | 2021 | Vision-language contrastive pre-training |
| DALL-E 1 | 2021 | Autoregressive text-to-image |
| LDM / Stable Diffusion | 2022 | Latent space diffusion |
| DALL-E 2 | 2022 | Diffusion with CLIP guidance |
| Imagen | 2022 | Cascaded diffusion with T5 |
| ControlNet | 2023 | Structural conditioning for diffusion |
| SDXL | 2023 | Improved architecture + dual encoders |
| Consistency Models | 2023 | Single-step generation |
| SD3 / FLUX | 2024 | Flow Matching + DiT architecture |
For Image-to-Text
| Algorithm | Year | Key Contribution |
|---|---|---|
| Show and Tell (NIC) | 2014 | CNN + LSTM captioning |
| Visual Attention | 2015 | Spatial attention for captions |
| Bottom-Up Features | 2018 | Object-level features (Faster R-CNN) |
| ViLBERT | 2019 | Dual-stream vision-language BERT |
| UNITER | 2019 | Universal image-text representation |
| CLIP | 2021 | Contrastive visual-language alignment |
| SimVLM | 2021 | PrefixLM for vision-language |
| BLIP | 2022 | Unified framework with bootstrapping |
| OFA | 2022 | Unified architecture for multiple tasks |
| BLIP-2 | 2023 | Q-Former + frozen LLM |
| LLaVA | 2023 | Visual instruction tuning |
| InstructBLIP | 2023 | Instruction tuning for BLIP-2 |
| LLaVA-1.5 | 2023 | MLP connector improvement |
| InternVL 2.5 | 2024 | State-of-art open-source VLM |
3.2 Key Techniques
Training Techniques
- Gradient Clipping: prevent exploding gradients (
clip_grad_norm_) - Learning Rate Schedulers: cosine annealing, OneCycleLR, warmup
- Mixed Precision Training: FP16/BF16 with loss scaling
- Gradient Checkpointing: trade compute for memory
- Exponential Moving Average (EMA): smoother model weights
- Data Augmentation: RandomCrop, RandomFlip, ColorJitter, RandAugment, CutMix, MixUp
- Label Smoothing, R-drop, Stochastic Depth
- Knowledge Distillation: teacher-student for smaller models
- Curriculum Learning: easy samples first, then hard ones
Efficient Fine-tuning
- LoRA (Low-Rank Adaptation): inject trainable rank-decomposition matrices
- QLoRA: quantize base model to 4-bit, apply LoRA on top
- DreamBooth: personalization of diffusion models with 3β30 images
- Textual Inversion: learn new text token embedding
- IP-Adapter: image prompt via decoupled cross-attention
- ControlNet: zero-conv + locked copy of U-Net encoder
- T2I-Adapter: lighter alternative to ControlNet
Inference Optimization
- Quantization: INT8, INT4, GPTQ, AWQ
- Pruning: magnitude-based, structured, lottery ticket
- Distillation: LCM (Latent Consistency Model) β 1β4 step inference
- TensorRT: NVIDIA inference engine
- ONNX Runtime: cross-platform inference
- DeepSpeed Inference, vLLM (for VLMs)
- Flash Attention 2: 2β4Γ speedup, reduced memory
- xFormers: memory-efficient attention operations
3.3 Essential Tools & Libraries
Model Development
PyTorchβ primary frameworkHugging Face Transformersβ pre-trained VLMs, LLMsHugging Face Diffusersβ diffusion model library (SDXL, FLUX, etc.)timmβ PyTorch Image Models (300+ CNN/ViT architectures)OpenCLIPβ open-source CLIP implementationaccelerateβ distributed training abstractionDeepSpeedβ ZeRO optimizer, model parallelismPEFTβ LoRA, prefix tuning, adapter methodsbitsandbytesβ 4-bit/8-bit quantization
Data & Dataset Tools
datasets(Hugging Face) β load LAION, COCO, CC12Mimg2datasetβ fast parallel image downloadingwebdatasetβ streaming large-scale datasetsFFCVβ high-throughput data loadingAlbumentationsβ fast image augmentation
Experiment Tracking
Weights & Biases (wandb)β metrics, images, hyperparameter sweepsMLflowβ open-source alternativeTensorBoardβ built into PyTorch/TensorFlowAimβ lightweight experiment tracker
Serving & Deployment
FastAPI/Flaskβ REST API backendsTriton Inference Server(NVIDIA) β high-performance model servingBentoMLβ MLOps packaging and servingReplicateβ GPU cloud for model hostingModalβ serverless GPU deploymentGradioβ quick demo UIsStreamlitβ data app UIsDocker+Kubernetesβ containerized deploymentONNX+TensorRTβ optimized inference
Evaluation
FID (FrΓ©chet Inception Distance)β image quality metricCLIP Scoreβ text-image alignmentIS (Inception Score)β diversity and qualityBLEU, ROUGE, CIDEr, METEORβ captioning metricsCLIPScoreβ reference-free captioning evaluationLPIPSβ perceptual image similarity
4. DESIGN & DEVELOPMENT PROCESS
4.1 Text-to-Image: Full Build Process
STEP 0: Environment Setup
Hardware: RTX 3090/4090 (24GB VRAM) or A100/H100
OS: Ubuntu 22.04 LTS
CUDA: 12.1+, cuDNN 8.9+
Python: 3.10+ with pyenv or conda
Install: PyTorch 2.x, Diffusers, Transformers, accelerateSTEP 1: Data Pipeline
Dataset Selection
- LAION-400M / LAION-5B: 400Mβ5B image-text pairs (web-scraped)
- CC3M / CC12M: Conceptual Captions (cleaner, smaller)
- COYO-700M: high-quality image-text pairs
- JourneyDB: Midjourney-generated images for fine-tuning style
- Internal Dataset: scrape + filter your own domain-specific data
Data Processing Pipeline
1. Download raw URLs β img2dataset (parallel wget + resize)
2. Filter by CLIP similarity score (keep pairs > 0.28)
3. Aesthetic filtering: LAION Aesthetics Predictor V2
4. NSFW filtering: CLIP-based classifiers
5. Deduplication: perceptual hashing (pHash) or SSCD embeddings
6. Caption enrichment: re-caption with CogVLM/LLaVA for richer text
7. Store as WebDataset format (.tar shards) on S3/NFSDataLoader Architecture
# WebDataset streaming pipeline
dataset = (
wds.WebDataset(urls, shardshuffle=True)
.shuffle(1000)
.decode("pil")
.to_tuple("jpg", "txt")
.map(preprocess_sample)
.batched(batch_size)
)STEP 2: VAE Training (Latent Compression)
Architecture
Encoder: Conv2d stack β ResBlocks β AttentionBlock β mean/logvar head
Bottleneck: 4-channel 64Γ64 latent (for 512Γ512 input, 8Γ compression)
Decoder: Linear projection β ResBlocks β AttentionBlock β Conv2d head
Discriminator: PatchGAN (for perceptual + adversarial loss)Loss Function
L_total = L_reconstruction (L1 + perceptual)
+ KL_weight * L_KL
+ adv_weight * L_adversarial
+ L_discriminatorTraining Config
Optimizer: Adam (lr=1e-4, Ξ²1=0.5, Ξ²2=0.9)
Batch size: 8β32 per GPU
Resolution: 256Γ256 initially, then 512Γ512
EMA: 0.999 decay
Precision: BF16STEP 3: Text Encoder
- Use pretrained CLIP ViT-L/14 or OpenCLIP ViT-H/14 (frozen initially)
- Optionally train T5-XXL (3B params) as second encoder (for better text)
- Text tokenization: max 77 tokens (CLIP), or 128/512 (T5)
- Output: sequence of text embeddings
[batch, seq_len, dim]
STEP 4: U-Net Diffusion Model
Architecture (Stable Diffusion-style)
Input: Noisy latent z_t [B, 4, 64, 64]
Time embedding: sinusoidal β MLP β added to ResBlocks
Encoder path:
DownBlock (ResBlock + SpatialAttention + CrossAttention) Γ 4
Bottleneck:
ResBlock + SpatialAttention + CrossAttention
Decoder path:
UpBlock (ResBlock + SpatialAttention + CrossAttention + skip) Γ 4
Output: Predicted noise Ξ΅ [B, 4, 64, 64]
Cross-Attention: Q from image features, K,V from text embeddingsTraining Objective (DDPM)
L_simple = E[||Ξ΅ - Ξ΅_ΞΈ(z_t, t, Ο_ΞΈ(y))||Β²]
where:
z_t = βαΎ±_t * z_0 + β(1-αΎ±_t) * Ξ΅ (forward process)
Ξ΅ ~ N(0, I)
Ο_ΞΈ(y) = text encoder output
t ~ Uniform(1, T)CFG Training (10β20% unconditional)
if random.random() < 0.1:
text_embeddings = uncond_embeddings # empty/null conditionSTEP 5: DiT Architecture (Modern Approach)
Diffusion Transformer (SD3/FLUX Style)
Input: Patchified latent [B, num_patches, dim]
Text: Separate token sequence
Architecture: Alternating self-attention + cross-attention (MMDiT)
or full joint attention (FLUX)
Scalable: 600M β 8B β 12B parameters
Position encoding: 2D RoPESTEP 6: Training Strategy
Stage 1: Low Resolution (256Γ256)
Steps: 200K
Batch: 2048 (across GPUs)
LR: 1e-4 with 10K warmup
Noise schedule: Linear (T=1000)Stage 2: High Resolution (512Γ512 or 1024Γ1024)
Steps: 500Kβ1M
Batch: 1024β4096
Multi-aspect ratio training
Fine-tune VAE jointly (optional)Stage 3: Instruction / Aesthetic Fine-tuning
DreamBooth fine-tuning for style
Human feedback data with reward model
RLHF or DPO on preference dataSTEP 7: Reverse Engineering Approach (Start from SDXL)
If building from scratch is too resource-intensive, reverse engineer:
1. Load SDXL weights from Hugging Face (4.2B params)
2. Inspect model architecture: model.unet.config
3. Trace forward pass with torch.fx or hooks
4. Identify cross-attention layers β replace text encoder
5. Add ControlNet: copy encoder half, add zero_convs
6. Fine-tune on custom data with DreamBooth/LoRA
7. Quantize to INT8 with bitsandbytes or GPTQ
8. Export to ONNX β TensorRT for deployment4.2 Image-to-Text: Full Build Process
STEP 1: Choose Architecture Paradigm
Option A: Frozen CLIP + Trainable MLP + Frozen LLM (LLaVA-style)
Option B: Trainable ViT + Q-Former + Frozen LLM (BLIP-2 style)
Option C: Full multimodal transformer (Flamingo, Gemini-style)STEP 2: Vision Encoder Setup
# Option: Load pre-trained ViT
from transformers import CLIPVisionModel
vision_encoder = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14-336")
# Freeze encoder initially
for param in vision_encoder.parameters():
param.requires_grad = FalseSTEP 3: Vision-Language Connector
Simple MLP Connector (LLaVA-1.5)
# Project visual features to LLM token space
connector = nn.Sequential(
nn.Linear(vision_dim, llm_dim),
nn.GELU(),
nn.Linear(llm_dim, llm_dim)
)Q-Former (BLIP-2)
Learnable query tokens [32 Γ 768]
Self-attention among queries
Cross-attention to image patches
Feed image features β get 32 compressed query outputs
Project to LLM embedding dimensionSTEP 4: Language Model Integration
Choose base LLM: LLaMA-3.1 8B, Mistral 7B, Qwen2.5 7B, Phi-3
Concatenate: [visual tokens] + [text tokens] β LLM
Training: Auto-regressive cross-entropy on text tokens onlySTEP 5: Training Stages (LLaVA Protocol)
Stage 1 (Pretraining):
- Freeze ViT + Freeze LLM
- Train only MLP connector
- Data: 558K image-text pairs (CC3M filtered)
- 1 epoch, ~3 hours on 8ΓA100
Stage 2 (Instruction Tuning):
- Unfreeze LLM (full or LoRA)
- Keep ViT frozen (or unfreeze top layers)
- Data: LLaVA-Instruct 665K visual conversations
- 1 epoch, ~15 hours on 8ΓA100STEP 6: Data for Image Captioning / VQA
Pretraining Data:
- LAION-COCO: 600M synthetic captions
- CC3M, CC12M, SBU Captions
- COYO-700M
Instruction Tuning Data:
- LLaVA-Instruct-150K / 665K
- TextVQA, VQAv2, GQA, OK-VQA
- NoCaps, Flickr30k, COCO Captions
- ShareGPT4V (high-quality GPT-4V captions)
- ALLaVA, LVIS-Instruct4V
STEP 7: Evaluation Benchmarks
Captioning: COCO captions (CIDEr, SPICE)
VQA: VQAv2, TextVQA, DocVQA
Understanding: MMBench, MME, SEED-Bench
OCR: OCRBench, ChartQA
Hallucination: POPE, HallusionBench
Reasoning: ScienceQA, MathVista5. WORKING PRINCIPLES, ARCHITECTURE & HARDWARE
5.1 Working Principles
Diffusion (Text-to-Image)
Forward Process (Data β Noise)
q(x_t | x_0) = N(x_t; βαΎ±_t * x_0, (1 - αΎ±_t) * I)
At t=T, x_T β N(0,I) β pure Gaussian noiseReverse Process (Noise β Data)
Start from x_T ~ N(0,I)
Iteratively denoise: p_ΞΈ(x_{t-1} | x_t) = N(x_{t-1}; ΞΌ_ΞΈ, Ο_ΞΈΒ²)
U-Net predicts noise Ξ΅_ΞΈ(x_t, t, c) given noisy image, timestep, condition
At end: x_0 = clean generated imageWhy it works: Neural network learns the gradient of the data distribution (score function), gradually pushing noisy samples back toward the data manifold.
Cross-Attention (Text Conditioning)
Text features (from CLIP/T5): K and V matrices
Image features (spatial): Q matrix
Attention = softmax(QΒ·K^T / βd) Β· V
Each spatial position attends to all text tokens
This is HOW text guides the image generationFlow Matching (Modern Alternative to DDPM)
Instead of noise prediction, learn a velocity field v_ΞΈ(x_t, t)
Straight paths from noise β data (rectified flows)
ODE: dx/dt = v_ΞΈ(x_t, t)
Advantages: fewer steps, more stable training, better quality
Used in: SD3, FLUX, LuminaVision-Language Alignment (Image-to-Text)
Image β patches β ViT tokens (e.g., 256 tokens for 336Γ336)
Text β tokenizer β embedding lookup
Tokens from both modalities flow through transformer
Causal masking on text, bidirectional on image
LLM generates text tokens auto-regressively conditioned on image5.2 Architecture Reference
U-Net Diffusion Model (SD 1.x/2.x/XL)
Params: 860M (SD1.4), 860M (SD2.1), 2.6B (SDXL)
Input resolution: 64Γ64 latents (512px or 1024px image)
Attention resolutions: 8, 16, 32 (spatial sizes)
Channels: 320 base (SD1.x), 320/640/1280 (SDXL)
Transformer depth per block: 1 (SD1.x), 1/2/10 (SDXL)
Text cross-attention dim: 768 (CLIP), 2048 (OpenCLIP)
Time embedding dim: 1280DiT (Diffusion Transformer β SD3/FLUX)
FLUX.1 dev: 12B params, 19 double-stream + 38 single-stream blocks
Patch size: 2Γ2 (16 latent channels)
Hidden dim: 3072 (FLUX), 4096 (DiT-XL)
Heads: 24
Sequence length: 4096 (image) + 77/256 (text)
Joint attention: image + text tokens attend to each other simultaneouslyBLIP-2 Architecture
ViT-L/14: 307M params (frozen)
Q-Former: 188M params (trainable)
- 32 learnable query tokens
- 12 transformer layers
- Self + Cross attention
LLM: OPT-2.7B / OPT-6.7B / FlanT5-XL (frozen)
Total trainable params at stage 1: ~188M (only Q-Former)LLaVA-1.5 Architecture
ViT: CLIP ViT-L/14 @ 336px β 576 visual tokens
Connector: 2-layer MLP with GELU
LLM: Vicuna-7B or Vicuna-13B (LLaMA-2 based)
Visual tokens prepended to text: [IMG_TOKENS] [INST_TOKENS]5.3 Hardware Requirements
Development / Research
| Model Type | Min GPU | Recommended | VRAM | Training Time |
|---|---|---|---|---|
| Fine-tune SD 1.5 LoRA | RTX 3060 | RTX 4090 | 8GB | Hours |
| Full SD 1.5 DreamBooth | RTX 3090 | RTX 4090 | 24GB | 1β2 hours |
| Train SD from scratch | 8ΓA100 | 64ΓA100 | 80GBΓ8 | Weeks |
| Fine-tune BLIP-2 | RTX 4090 | A100 80GB | 24β40GB | Days |
| Train LLaVA-1.5 (7B) | 8ΓA100 | 8ΓA100 | 80GBΓ8 | ~12 hours |
| Fine-tune LLaVA LoRA | RTX 4090 | A100 | 24GB | Hours |
| FLUX.1 Inference | RTX 4090 | A100 | 24GB | β |
| FLUX.1 Fine-tune | 4ΓA100 | 8ΓA100 | 80GBΓ4 | Days |
Cloud Platforms
AWS: p3.16xlarge (8ΓV100), p4d.24xlarge (8ΓA100), p5.48xlarge (8ΓH100)
GCP: a2-highgpu-8g (8ΓA100 40GB), a3-highgpu-8g (8ΓH100)
Azure: NDv4 (8ΓA100), NDv5 (8ΓH100)
Lambda Labs: GPU cloud, cheaper than AWS/GCP
RunPod: spot GPU instances, cheapest option
Vast.ai: peer-to-peer GPU marketplaceLocal Setup (Minimum Viable)
Text-to-Image inference (SD 1.5): RTX 3060 12GB
Text-to-Image inference (SDXL): RTX 3090/4090 24GB
Image-to-Text inference (LLaVA 7B): RTX 3090 24GB (or 2Γ16GB)
Image-to-Text inference (LLaVA 13B): 2ΓRTX 3090 or A6000 48GB
Fine-tuning with LoRA (most models): RTX 4090 24GB
Storage: 2TB NVMe SSD minimum for datasets and models
RAM: 64GB+ recommended
CPU: 16+ cores for data preprocessingOptimal Training Cluster
Nodes: 4β16 machines
Per node: 8ΓH100 80GB SXM5
Interconnect: NVLink (within node), InfiniBand HDR/NDR (between nodes)
Storage: Parallel file system (Lustre, GPFS, or NFS on SSD RAID)
Networking: 400Gb/s InfiniBand
Software: NCCL for collective communications6. CUTTING-EDGE DEVELOPMENTS (2024β2025)
6.1 Text-to-Image Frontier
Architecture Innovations
- FLUX.1 (Black Forest Labs, 2024): 12B rectified flow transformer, state-of-art open weights for T2I; superior text rendering and photorealism
- Stable Diffusion 3.5 Large: MMDiT-X with improved conditioning and quality
- Lumina-T2X: Flow-based DiT with Next-DiT blocks, dynamic resolution
- PixArt-Ξ£: Ultra-high resolution (4K) efficient T2I transformer
- HiDiffusion: Training-free approach for arbitrary resolution generation
- SynCamMaster: Multi-camera video generation with synchronized views
Video Generation (Extension of T2I)
- Sora (OpenAI): Spacetime patch-based video diffusion
- Wan 2.1 (Alibaba): Open-source video generation, 14B params
- Kling (Kuaishou): High-quality video gen with motion control
- HunyuanVideo (Tencent): 13B params, open weights video model
- CogVideoX: DiT-based open video generation model
- Mochi-1: 10B diffusion transformer for video
Editing & Control Advances
- InstructPix2Pix: Edit images with text instructions
- MasaCtrl: Training-free consistent image editing
- IP-Adapter FaceID: Identity-preserving generation
- InstantID: Single-image ID-preserving generation with ControlNet
- PhotoMaker V2: Style-consistent person generation
- ELLA: LLM-enhanced CLIP for better prompt adherence
Speed & Efficiency
- LCM (Latent Consistency Model): 4-step generation, 10Γ faster
- LCM-LoRA: Apply consistency distillation as LoRA adapter
- SDXL-Lightning: 1β4 step adversarial diffusion distillation
- Hyper-SD: Trajectory-segmented consistency distillation
- TurboEdit: Real-time image editing in 1β2 diffusion steps
6.2 Image-to-Text / VLM Frontier
Model Releases (2024β2025)
- LLaVA-OneVision: Multi-image, multi-granularity understanding
- InternVL 2.5: Top open-source VLM, beats many proprietary models
- Qwen2.5-VL: Strong open-source VLM with video understanding
- Phi-3.5 Vision: Efficient VLM (4B params) for edge deployment
- MiniCPM-V 2.6: 8B model with GPT-4V level capability
- Pixtral 12B (Mistral): First open multimodal Mistral model
- Molmo (Allen AI): Open VLM trained on human-annotated data
- Cambrian-1: Spatial vision-centric VLM benchmark
Technical Innovations
- Dynamic Resolution: Process any aspect ratio without distortion (LLaVA-HD, InternVL)
- Pixel Shuffle / AnyRes: Efficient high-resolution image encoding
- Chain-of-Thought Visual Reasoning: R1-style reasoning for VLMs
- Grounding + Captioning: Unified models for detection + description
- Document Understanding: DocVQA, chart/table parsing (DocOwl, mPLUG-DocOwl 1.5)
- Dense Prediction + Language: SAM 2 + LLM for segmentation + description
6.3 Emerging Paradigms
- World Models: GAIA-1, Genie, UniSim β understanding physical world through generation
- Unified Any-to-Any Models: Unified-IO 2, NExT-GPT β any modality in, any out
- Test-Time Compute: Using more compute at inference (R1-style for vision)
- Synthetic Data Pipelines: Generate training data with T2I for downstream tasks
- 3D Generation: Zero123++, One-2-3-45, Stable Zero123, OpenLRM, InstantMesh
7. PROJECT BUILD IDEAS (BEGINNER β ADVANCED)
π’ BEGINNER LEVEL (Learn Core Concepts)
Project 1: MNIST Variational Autoencoder Beginner
Goal: Understand latent spaces and generation
Stack: PyTorch, matplotlib
Features: Encode digits to 2D latent, sample and decode
Learning: VAE math, reparameterization trick, ELBO
Time: 1β2 daysProject 2: CIFAR-10 DCGAN Beginner
Goal: Build your first GAN
Stack: PyTorch, WandB
Features: Generate 32Γ32 images, training curves
Learning: GAN training dynamics, mode collapse debugging
Time: 2β3 daysProject 3: Basic Image Captioning with BLIP Beginner
Goal: Run inference with pre-trained model
Stack: Transformers, Gradio
Features: Upload image β get captions
Learning: VLM inference, tokenization, beam search
Time: 1 dayProject 4: Text-to-Image with Diffusers Beginner
Goal: Generate images from text prompts
Stack: Diffusers, SDXL weights
Features: Prompt β image, CFG scale control
Learning: Diffusion inference pipeline, sampling schedulers
Time: 1 dayπ‘ INTERMEDIATE LEVEL (Build Real Features)
Project 5: Custom Image Captioning Dataset + Fine-tuning Intermediate
Goal: Fine-tune BLIP-2 on domain-specific data (e.g., medical images, fashion)
Stack: Transformers, PEFT, WandB
Features: Custom dataset loader, LoRA fine-tuning, evaluation with CIDEr
Learning: Data pipelines, VLM fine-tuning, evaluation metrics
Time: 1β2 weeksProject 6: Personal DreamBooth Model Intermediate
Goal: Fine-tune Stable Diffusion to generate images of yourself
Stack: Diffusers, accelerate, wandb
Features: 15 personal photos β custom model, prompt: "photo of [V] person"
Learning: DreamBooth training, prior preservation loss, overfitting mitigation
Time: 3β5 daysProject 7: ControlNet Application Intermediate
Goal: Build a pose-conditioned image generator
Stack: Diffusers, ControlNet-OpenPose, MediaPipe
Features: Webcam β pose β generate person in pose
Learning: Structural conditioning, ControlNet architecture
Time: 1 weekProject 8: Image Search Engine with CLIP Intermediate
Goal: Search 100K images with natural language
Stack: CLIP, FAISS, FastAPI, React frontend
Features: "red sports car sunset" β top 20 matching images
Learning: Embedding spaces, vector search, cosine similarity
Time: 1β2 weeksProject 9: Visual QA Chatbot Intermediate
Goal: Build a chatbot that answers questions about images
Stack: LLaVA/BLIP-2, FastAPI, Gradio
Features: Multi-turn conversation about uploaded images
Learning: Multi-turn VLM inference, conversation templates
Time: 1 weekProject 10: Aesthetic Image Scorer + Filter Intermediate
Goal: Auto-filter dataset by aesthetic quality
Stack: CLIP, aesthetic predictor MLP, WandB
Features: Score images 1β10, batch filter pipeline
Learning: CLIP embeddings, linear probing, dataset curation
Time: 3β5 daysπ΄ ADVANCED LEVEL (Research & Production)
Project 11: Train Latent Diffusion Model from Scratch Advanced
Goal: Train a small LDM (256px) on a custom domain
Stack: PyTorch, accelerate, DeepSpeed, WandB, WebDataset
Features: Custom VAE, UNet, CLIP conditioning, full training loop
Learning: Large-scale training, distributed training, EMA, FID evaluation
Hardware: 4β8ΓA100 or 4β8Γ4090
Time: 2β4 weeksProject 12: Fine-tune LLaVA on Medical Imaging Advanced
Goal: Build a medical image description VLM
Stack: LLaVA codebase, DeepSpeed, MIMIC-CXR dataset
Features: Chest X-ray β radiology report generation
Learning: Medical VLM, clinical NLP evaluation, HIPAA considerations
Time: 2β3 weeksProject 13: Build a LoRA Marketplace Advanced
Goal: Platform to create, share, and use LoRA adapters
Stack: FastAPI, React, PostgreSQL, S3, Diffusers, GPU worker queue
Features: Upload training images β auto-train LoRA β share/sell
Learning: MLOps, async task queues (Celery/Redis), GPU job scheduling
Time: 1β2 monthsProject 14: Real-Time Image Editing API Advanced
Goal: Production text-guided image editing service
Stack: InstructPix2Pix / TurboEdit, TensorRT, FastAPI, WebSocket
Features: Upload image + instruction β edited image in <3 seconds
Learning: Model optimization, TensorRT export, streaming results
Hardware: A100 or H100 for low latency
Time: 3β4 weeksProject 15: Multimodal RAG System Advanced
Goal: Retrieve and reason over images + text documents
Stack: LLaVA, CLIP, FAISS, LLaMA, LangChain, FastAPI
Features: Mixed document store β query β retrieve relevant images/text β LLM answers
Learning: RAG architecture, multimodal retrieval, hybrid search
Time: 3β5 weeksProject 16: Video Captioning Pipeline Advanced
Goal: Auto-caption videos for accessibility/SEO
Stack: CogVideoX or InternVL, FFmpeg, Whisper, FastAPI
Features: Video β extract frames β caption + transcribe β rich description
Learning: Temporal understanding, video VLMs, pipeline orchestration
Time: 2β3 weeks8. BUILDING & DEPLOYING YOUR OWN SERVICE
8.1 Service Architecture
Microservices Design
βββββββββββββββββββββββββββββββββββββββββββββββ
β API Gateway (nginx) β
ββββββββββββ¬βββββββββββββββββββββββ¬ββββββββββββ
β β
ββββββββΌβββββββ βββββββββΌβββββββ
β T2I Service β β I2T Service β
β (FastAPI) β β (FastAPI) β
ββββββββ¬βββββββ βββββββββ¬βββββββ
β β
ββββββββΌβββββββ ββββββββΌβββββββ
β GPU Worker β β GPU Worker β
β (Celery) β β (Celery) β
ββββββββ¬βββββββ βββββββββ¬βββββββ
β β
ββββββββΌβββββββββββββββββββββββΌβββββββ
β Redis (Task Queue) β
ββββββββββββββββββββββββββββββββββββββ
β
ββββββββΌβββββββ
β PostgreSQL β (Jobs, Users, Results)
βββββββββββββββ
β
ββββββββΌβββββββ
β S3 / MinIO β (Images, Models)
βββββββββββββββREST API Design
Text-to-Image Endpoint
POST /v1/generate
{
"prompt": "a photorealistic cat on a red sofa",
"negative_prompt": "blurry, low quality",
"width": 1024,
"height": 1024,
"num_inference_steps": 28,
"guidance_scale": 7.5,
"seed": 42,
"model": "sdxl"
}
Response:
{
"job_id": "abc-123",
"status": "queued",
"eta_seconds": 8
}
GET /v1/jobs/{job_id}
Response:
{
"status": "complete",
"image_url": "https://cdn.yourservice.com/...",
"generation_time": 4.2
}Image-to-Text Endpoint
POST /v1/caption
{
"image_url": "https://...", // or base64
"task": "detailed_caption", // or "vqa", "ocr"
"question": "What objects are in this image?" // for VQA
}
Response:
{
"caption": "A golden retriever sits on a...",
"confidence": 0.94,
"processing_time": 1.2
}8.2 Model Optimization for Production
Quantization Pipeline
# GPTQ quantization (for LLaVA LLM part)
from transformers import GPTQConfig
quantization_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
# BitsAndBytes 4-bit
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)TensorRT Export (for T2I)
# Export SDXL UNet to TensorRT
from polygraphy.backend.trt import TrtRunner
# Use torch2trt or Hugging Face optimum-nvidia
from optimum.nvidia import AutoModelForCausalLM # for VLM LLM partBatching Strategy
T2I: Usually batch=1 (high VRAM per image), use request queuing
I2T: Can batch 4β8 requests (caption is faster than generation)
Dynamic batching: Triton Inference Server handles automatically8.3 Monitoring & Observability
Key Metrics to Track
Generation latency (P50, P95, P99)
Queue depth (pending jobs)
GPU utilization per worker
VRAM usage
Cache hit rate (same prompts)
Error rate (OOM, timeout, etc.)
Cost per generation
User quality scores (thumbs up/down)Tools
- Prometheus + Grafana: infrastructure metrics
- Sentry: error tracking
- OpenTelemetry: distributed tracing
- Datadog / New Relic: APM
- Custom: log generation params + user ratings to PostgreSQL for fine-tuning feedback
8.4 Cost Optimization
Strategies
- Spot/preemptible instances: 60β80% cheaper (handle interruptions gracefully)
- Model distillation: LCM reduces steps 30β4, ~8Γ cost reduction
- Quantization: 4-bit reduces VRAM 4Γ, fit more on cheaper GPUs
- Caching: Exact prompt cache (Redis), semantic cache (FAISS + threshold)
- Batching: Maximize GPU utilization
- Cold start management: Keep 1 warm instance, scale 0βN on demand
- Regional pricing: Use cheaper AWS regions (us-east-2 vs us-west-2)
Estimated Costs (2024)
SDXL on A100 80GB: ~300 images/hour β $0.005β0.01 per image
LLaVA-7B on A100: ~500 captions/hour β $0.002β0.005 per caption
With quantization + LCM: 5β10Γ cost reduction possible8.5 Safety & Content Moderation
NSFW / Safety Filters
- Input: Prompt safety classifier (fine-tuned BERT on harmful prompts)
- Output: NSFW image classifier (e.g., Falconsai/nsfw_image_detection)
- Watermarking: Stable Signature, invisible watermarks for generated images
- Rate limiting: Per-user and per-IP limits
- Logging: All generations logged for abuse review
π REFERENCE PAPERS (Must-Read)
Foundational
- "Auto-Encoding Variational Bayes" β Kingma & Welling (2013)
- "Generative Adversarial Nets" β Goodfellow et al. (2014)
- "Attention Is All You Need" β Vaswani et al. (2017)
- "An Image is Worth 16x16 Words" β Dosovitskiy et al. (ViT, 2020)
Diffusion Models
- "DDPM" β Ho et al. (2020) | arXiv: 2006.11239
- "DDIM" β Song et al. (2020) | arXiv: 2010.02502
- "Score-Based Generative Modeling through SDEs" β Song et al. (2021)
- "Latent Diffusion Models" β Rombach et al. (2022) | arXiv: 2112.10752
- "Scalable Diffusion Models with Transformers (DiT)" β Peebles & Xie (2022)
- "Flow Matching for Generative Modeling" β Lipman et al. (2022)
- "Consistency Models" β Song et al. (2023)
Vision-Language
- "CLIP" β Radford et al. (2021) | arXiv: 2103.00020
- "BLIP" β Li et al. (2022) | arXiv: 2201.12086
- "BLIP-2" β Li et al. (2023) | arXiv: 2301.12597
- "LLaVA" β Liu et al. (2023) | arXiv: 2304.08485
- "LLaVA-1.5" β Liu et al. (2023) | arXiv: 2310.03744
- "Flamingo" β Alayrac et al. (2022) | arXiv: 2204.14198
Control & Editing
- "ControlNet" β Zhang & Agrawala (2023) | arXiv: 2302.05543
- "InstructPix2Pix" β Brooks et al. (2022) | arXiv: 2211.09800
- "DreamBooth" β Ruiz et al. (2022) | arXiv: 2208.12242
π COMMUNITY & RESOURCES
Online Platforms
- Hugging Face Hub: Models, datasets, Spaces demos
- Papers With Code: Implementation + benchmarks
- arXiv cs.CV + cs.LG: Latest papers
- Civitai: Community SD models, LoRAs
- GitHub: Diffusers, LLaVA, ComfyUI, A1111
Key Courses
- Fast.ai Part 2: Deep learning from foundations
- DeepLearning.AI Specialization: Andrew Ng (Coursera)
- Stanford CS231n: CNN for Visual Recognition
- Stanford CS224N: NLP with Deep Learning
- Hugging Face Courses: Diffusion Models, NLP, RL
Communities
- Reddit: r/LocalLLaMA, r/StableDiffusion, r/MachineLearning
- Discord: Hugging Face, Stable Diffusion, EleutherAI
- Twitter/X: Follow @hardmaru, @karpathy, @sama, @rivershavewings